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Abstract 

Much recent work in bioinformatics has focused on the inference of various types of biological networks, 
representing gene regulation, metaboUc processes, protein-protein interactions, etc. A common setting 
involves inferring network edges in a supervised fashion from a set of high-confidence edges, possibly 
characterized by multiple, heterogeneous data sets (protein sequence, gene expression, etc.). Here, we 
distinguish between two modes of inference in this setting: direct inference based upon similarities 
between nodes joined by an edge, and indirect inference based upon similarities between one pair of 
nodes and another pair of nodes. We propose a supervised approach for the direct case by translating it 
into a distance metric learning problem. A relaxation of the resulting convex optimization problem leads 
to the support vector machine (SVM) algorithm with a particular kernel for pairs, which we call the 
metric learning pairwise kernel. We demonstrate, using several real biological networks, that this direct 
approach often improves upon the state-of-the-art SVM for indirect inference with the tensor product 
pairwise kernel. 

1 Introduction 

Increasingly, molecular and systems biology is concerned with describing various types of subcellular net- 
viforks. These include protein-protein interaction networks, metabolic networks, gene regulatory and signal- 
ing pathways, and genetic interaction networks. While some of these networks can be partly deciphered by 
high-throughput experimental methods, fully constructing any such network requires lengthy biochemical 
validation. Therefore, the automatic prediction of edges from other available data, such as protein sequences, 
global network topology or gene expression profiles, is of importance, either to speed up the elucidation of 
important pathways or to complement high-throughput methods that are subject to high levels of noise [23]. 

Edges in a network can be inferred from relevant data in at least two complementary ways. For concrete- 
ness, consider a network of protein-protein interactions derived from some noisy, high-throughput technology. 
Our confidence in the correctness of a particular edge A — B in this network increases if we observe, for exam- 
ple, that the two proteins A and B localize to the same cellular compartment or share similar evolutionary 
patterns |19[I17[[T^ . Generally, in this type of direct inference, two genes or proteins are predicted to interact 
if they bear some direct similarity to each other in the available data. 

An alternative mode of inference, which we call indirect inference, relies upon similarities between pairs 
of genes or proteins. In the example above, our confidence in A — B increases if we find some other, high- 
confidence edge C — D such that the pair {A, B} resembles {C, D} in some meaningful fashion. Note that in 
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this model, the two connected proteins A and B might not be similar to one another. For example, if the goal 
is to detect edges in a regulatory network by using time series expression data, one would expect the time 
series of the regulated protein to be delayed in time compared to that of the regulatory protein. Therefore, 
in this case, the learning phase would involve learning this feature from other pairs of regulatory/regulated 
proteins. The most common application of the indirect inference approach in the case of protein-protein 
interaction involves comparing the amino acid sequences of A and B versus C and D (e.g., |21llSl [ni l5]). 

Indirect inference amounts to a straightforward application of the machine learning paradigm to the 
problem of edge inference: each edge is an example, and the task is to learn by example to discriminate 
between "true" and "false" edges. Not surprisingly, therefore, several machine learning algorithms have been 
applied to predict network edges from properties of protein pairs. For example, in the context of machine 
learning with support vector machines (SVM) and kernel methods, Ben-Hur and Noble describe how to 
map an embedding of individual proteins onto an embedding of pairs of proteins. The mapping defines two 
pairs of proteins as similar to each other when each protein in a pair is similar to one corresponding protein 
in the other pair. In practice, the mapping is defined by deriving a kernel function on pairs of proteins from 
a kernel function Kg on individual proteins, obtained by a tensorization of the initial feature space. We 
therefore call this pairwise kernel, shown below, the tensor product pairwise kernel (TPPK): 

Ktppk {{xi,X2) , (a;3,a;4)) = Kg (xi, xs) Kg {x2,xa) + Kg {xi^xa) Kg {x2,X'i) . (1) 

Less attention has been paid to the use of machine learning approaches in the direct inference paradigm. 
Two exceptions are the works of Yamanishi et al. |25| and Vert et al. |22j , who derive supervised machine 
learning algorithms to optimize the measure of similarity that underlies the direct approach by learning 
from examples of interacting and non- interacting pairs. Yamanishi et al. employ kernel canonical correlation 
analysis to embed the proteins into a feature space where distances are expected to correlate with the presence 
or absence of interactions between protein pairs. Vert et al. highlight the similarity of this approach with 
the problem of distance metric learning _24 , while proposing an algorithm for that purpose. 

Both of these direct inference approaches, however, suffer from two important drawbacks. First, they are 
based on the optimization of a proxy function that is slightly different from the objective of the embedding, 
namely, finding a distance metric such that interacting/non-interacting pairs fall above/below some threshold. 
Second, the methods of 25 and .22J are applicable only when the known part of the network used for training 
is defined by a subset of proteins in the network. In other words, in order to apply these methods, we must 
have a complete set of high-confidence edges for one set of proteins, from which we can infer edges in the 
rest of the network. This setting is unrealistic. In practice, our training data will generally consist of known 
positive and negative edges distributed throughout the target network. 

In this paper we propose a convex formulation for supervised learning in the direct inference paradigm 
that overcomes both of the limitations mentioned above. We show that a slight relaxation of this formulation 
bears surprising similarities with the supervised approach of 'S', in the sense that it amounts to defining a 
kernel between pairs of proteins from a kernel between individual proteins. We therefore call our method 
the metric learning pairwise kernel (MLPK). An important property of this formulation as an SVM is the 
possibility to learn from several data types simultaneously by combining kernels, which is of particular 
importance in various bioinformatics applications |16[I12| . 

We validate the MLPK approach on the task of reconstructing two yeast networks: the network of 
metabolic pathways and the co-complex network. In each case, the network is inferred from a variety of 
genomic and proteomic data, including protein amino acid sequences, gene expression levels over a large set 
of experiments, and protein cellular localization. We show that the MLPK approach nearly always provides 
better prediction performance than the state-of-the-art TPPK approach. 

2 Algorithm 

Let us assume that a gene is represented by a vector x e of genomic data such as a microarray expression 
profile. The problem of supervised gene network inference is, given a set of n genes xi, . . . ,Xn and a training 
set T — T U Af C [l,n]'^ of interacting (Z) and non- interacting (A/") pairs, to predict whether pairs of genes 
not in the training set interact or not. Following fl^, we note that a possible approach to solve this problem 
is to learn a distance metric d between genes with the property that pairs of nearby genes with respect to d 
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are connected by an edge, while pairs of genes far from each other are not. If such a metric is available, then 
the prediction of an edge between a candidate pair of genes simply amounts to computing their distance to 
each other and predicting an edge if the distance is below a threshold. 

2.1 Distance metric learning 

More formally, let us investigate distance metrics obtained by linear transformations of the input space. 
Such metrics are indexed by symmetric positive semidefinite matrices M as follows: 

duix, x') = [x — x')^ M [x — x') . 

Our goal is to learn a distance metric which separates interacting from non-interacting pairs, while controlling 
over-fitting to the training set. Following the spirit of the SVM algorithm, we enforce an arbitrary margin 
of 1 between the distances of interacting and non-interacting variables up to slack variables, and control the 
Frobenius norm of M by considering the following problem: 

unn\\M\\l,, + C C..-' (2) 

under the constraints: 

Cy>o, («,j)gt, 

dM{xt,x-i)<'^-l + Qj, {i,j)eX, 
dMixi,Xj) >-f + 1 - Qj, {i,j)€J\f, 
M>Q. 

In order to solve this problem we first prove the following extension to the representer theorem |10| : 
Lemma 1 The solution of can he expanded as: 

M = ^ aij{xi - Xj){xi - Xj)^ , 

with aij G K. for {i,j) £ T. 

Proof For any pair (i, j), let us denote ity = Xi— Xj, and let Dij be the p x p matrix Dij — {xi — Xj){xi — 
Xj)'^ — Uijujj. Then we can rewrite 

dMixi,Xj) = {M,D,j)p^^ , 

Trace{A^ B) is the Frobenius inner product. Introducing the hinge loss function 
yy',0) for y,y' e R, we can eliminate the slack variables and rewrite the problem 

Ivl ^ U . '7 £ 

This shows that the optimization problem is in fact equivalent, up to the positive semidefinitiveness con- 
straint, to an SVM in the linear space of symmetric matrices endowed with the Frobenius inner product. 
Each edge example is then mapped to the matrix Dij. In particular, if the constraint on M was not present, 
then Lemma n would be exactly the representer theorem. Here we need to show that it still holds with the 
constraint M ^ 0. For this purpose let M ^ and 7 G M be the solution of H2I3|) . M can be uniquely 
decomposed as M = Ms + M±, where Ms is in the linear span of {Dij, € T) and {M±, Dij) = 
for (z, j) G T. By the Pythagorean theorem we have || M = || Ms -I- p/j_ so if M_l 7^ the 
functional minimized in Q is strictly smaller at {Ms,^) than at (M, 7); this would be a contradiction if 
Ms h 0. Therefore, to prove the lemma it suffices to show Ms >: 0. Let € be any vector. We can 



where (A,B)p.^^ = 
L{y,y') = max(l - 
(120 as: 
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decompose that vector uniquely as v ~ vs + v±, where vs is in the linear span of the mj, £ T and 
v\uij ~ for (i, j) G T. We then have MsVi_ ~ and M±^vs = 0, and therefore 

Msv = Vg Msvs — Vg Msvs + vj M±vs = vj Mvs > , 

where we used the fact that Af >^ in the last inequality. This is true for any u G M^, which shows that 
Ms h 0, concluding the proof. ■ 



2.2 Kernelization 

By plugging the result of Lemma ^ into H2I3() we see that this problem is equivalent to that of finding 
o^ij^ihj) G ^ ^nd 7. In order to write out the problem explicitly, let us introduce the following kernel 
between two pairs {xi,X2) and (0:3, 2:4) : 

KmLPK {{X1,X2) , {X3,X4)) = {DxiX2:Dx3X4)Fro 

= Trace (^{xi - X2) {xi - 0:2)^ (^3 - ^4) (2:3 - 2^4)^) 

. T V 

= Uxi - X2) {X3 - X4)] 

= {xj X3 — xj X4 — xj X3 + 2:^X4) . 

This kernel is positive definite because it is the Frobenius inner product between the matrices Dab representing 
the pairs. Moreover, although Kmlpk is formally defined for ordered pairs only, we observe that it is 
invariant by permutation of the elements of each pair (e.g., when xi and X2 are flipped). It can therefore be 
considered as a positive definite kernel over the set of unordered pairs, seen as the quotient space of the set 
of ordered proteins with respect to the equivalence relation of permutation among each pair. We call this 
kernel for unordered pairs the metric learning pairwise kernel (MLPK), hence the notation Kmlpk- 

In order to express the problem 1)213(1 in terms of the a variables provided by Lemma ^ we need to 
express the constraint M >^ in terms of a. Denoting pairs of indices t = {i,j), Lemma ^ ensures that M 
can be written as M = X^teT oitUtuJ . As we showed in the proof of Lemma ^ this implies that M is null 
on the space orthogonal to the linear span of (Mt,i G T). Therefore, M >^ if and only if Mv > for 
any v in the linear span of {ut,t G T). This is equivalent to the fact that the | T | x | T | matrix F defined 
by Ff^t' = uj Mut' is positive semidcfinitc. Finally, if we denote by Ft the | T | x | T | matrix whose {ti, 12) 
entry is uJ^DtUt^ — uj_^utujut2, this is equivalent to J2teT '^t^t h 0. 

Plugging the representation of Lemma H into (|2I3|I . and replacing the Frobenius inner product by the 
MLPK kernel, we show that the problem is equivalent to 

min ^ ^ a^jauKMLPK {{xt,Xj),{xk,xi)) + C ^ Qj, (6) 
(ij)eT(fe,or {i.j)<ET 

under the constraints: 

Cy>0, {i,j)eT, 
akiKMLPK{{xi,Xj),{xk,xi)) <j-l + Cij, {i,j)eX, 

auKMLPK({xi,Xj),(xk,xi))>-i +1- Cij, {i,j)eAf, C^) 

(fe,i)er 

OLkiFki h 

{k,l)eT 

An important property of this problem is that the data only appear through the kernel Kmlpk and the 
matrices Fij. Furthermore, the MLPK kernel itself (jSJ computed between two pairs of vectors only involves 
inner products between the vectors; similarly the (ti,t2)-th entry of the matrix Ft is a product of inner 
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products, which can easily be computed from the inner products of the data themselves. As a result, we can 
apply the kernel trick to extend the problem (|6I7|I to any data space endowed with a positive definite kernel 
Kg. The resulting MLPK kernel between pairs becomes 

KmLPK {{xi,X2) , {X3,X4)) = {Kg {xi,X3) - Kg {xi,X4) ~ Kg {X2,X3) + Kg {x2,X4:)f , 

and for any three pairs t = (i,j),ti = {ii,ji),t2 = {i2,j2) in T the entry (ti,i2) of Ft is 

[Kg {Xi-^ : Xi^ Kg {Xi-^ , Xj^ Kg {Xj-^ , Xj) ^" Kg {Xj-^ , Xj)^ 

X [Kg {Xi2 : Xi^ Kg {Xi2 , Xj ) Kg {Xj2 T Xi^ -\- Kg {Xj2 , Xj )] . 

2.3 Relaxation 

The problem H6I7() is a convex problem over the cone of positive semidefinite matrices that can in theory 
be solved by algorithms such as interior-point methods The dimension of this problem, however, is 

2 |T| + 1. This is typically of the order of several thousands for small biological networks with a few 
hundreds or thousands vertices, which poses serious convergence issues for general-purpose optimization 
software. 

If we relax the condition M ^ in the original problem, then it becomes the quadratic program of the 
SVM, for which dedicated optimization algorithms have been developed: current implementations of SVM 
easily handle several tens of thousands of dimensions [211 • The obvious drawback of this relaxation is that 
if the matrix M is not positive semidefinite, then it does not define a metric. Although this can be a serious 
problem for classical applications of distance metric learning such as clustering [21], we note that in our case 
the goal of metric learning is just to provide a decision function /(x, x') = dM{x, x') for predicting connected 
pairs, and negativity of this decision function is not a problem in itself. Therefore, we propose to relax the 
constraint M ^ 0, or equivalently X](A;i)eT '^kiFk,i h in {T)), and to solve the initial problem using an SVM 
over pairs with the MLPK kernel 

3 Experiments 

We present below a comparison of the previously described TPPK kernel and the new MLPK kernel for the 
reconstruction of two biological networks: the metabolic network and the co-complex protein network. For 
each network, we cast the problem of network reconstruction as a binary classification problem, where the 
presence or absence of edges must be inferred from various types of data relevant to the proteins. Because the 
network contains relatively few edges compared to the total number of possible pairs, we created a balanced 
dataset by keeping all known edges as positive examples and randomly sampling an equal number of absent 
edges as negative examples. We compare the utilities of the TPPK and MLPK kernels in this context by 
assessing the performance of an SVM for edge prediction in a five-fold cross-validation experiment repeated 
three times (3x5cv) with different random folds. At each fold, the regularization parameter C of the SVM 
is chosen among 18 values evenly log-spaced on the interval [10~*, 50] by minimizing the classification error 
estimated by five-fold cross-validation within the training set only. 

3.1 Metabolic network 

Most biochemical reactions in living organisms are catalyzed by particular proteins called enzymes, and 
occur sequentially to form metabolic pathways. For example, the degradation of glucose into pyruvate 
(called glycolysis) involves a sequence of ten chemical reactions catalyzed by ten enzymes. The metabolic 
gene network is defined as an undirected graph with enzymes as vertices and with edges connecting pairs 
of enzymes that can catalyze successive chemical reactions. The reconstruction of metabolic pathways for 
various organisms is of critical importance, e.g., to find new ways to synthesize chemical compounds of 
interest. This problem motivated earlier work on supervised graph inference Focusing on the 

budding yeast S. cerevisiae, we collected the metabolic network and genomic data used in [21] . The network 
was extracted from the KEGG database and contains 769 vertices and 3702 undirected edges. 
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Table 1: Performance on reconstruction of the yeast metabolic and co-complex networks. The 

table lists, for each network and each type of data, the accuracy and area under the ROC curve obtained by 
each pairwise kernel. Values in the tables are means and standard errors in a 3x5cv experiment. TPPK is 
the tensor product pairwise kernel, and MLPK is the metric learning pairwise kernel. 



Network 


Data 


ML 
Accuracy 


PK 

AUC 


TP 
Accuracy 


PK 

AUC 


Metabolic 


Expression 
Localization 
Phylogenetic profile 
Yeast two-hybrid 
Sum 


77.8 ±0.2 

63.9 ±0.4 

79.8 ±0.1 
76.6 ±0.2 

83.9 ±0.4 


84.5 ±0.1 
68.2 ±0.4 
84.9 ±0.2 
82.0 ±0.1 
90.9 ±0.3 


76.7 ±0.3 

62.3 ±0.1 

78.4 ±0.1 
59.2 ±0.1 
84.2 ±0.5 


83.3 ±0.2 
65.8 ±0.4 

83.4 ±0.4 
65.1 ±0.6 
91.1 ±0.3 


Co-complex 


Localization 
Chip-chip 
Pfam 
PSLBLAST 


76.5 ±0.1 
82.4 ±0.3 
92.2 ±0.2 
90.0 ±0.3 


76.8 ±0.1 
89.7 ±0.1 

98.2 ±0.1 

97.3 ±0.1 


79.6 ±0.1 
63.8 ±0.1 
85.5 ±0.1 
88.3 ±0.1 


83.1 ±0.1 
67.9 ±0.3 
91.7 ±0.2 
93.6 ±0.2 



In order to infer the network, various independent data about the proteins can be used. In this experiment, 
we use four relevant sources of data provided by |25|: (1) a set of 157 gene expression measurements obtained 
from DNA microarrays; (2) the phylogenetic profiles of the genes, represented as 145-bit vectors indicating 
the presence or absence of each gene in 145 fully sequenced genomes; (3) the protein's localization in the cell 
determined experimentally [5], represented as 23-bit vectors corresponding to 23 cellular compartments, and 
(4) yeast two-hybrid protein-protein interaction data , represented as a network. For the first three data 
sets, a Gaussian RBF kernel was used to represent the data as a kernel matrix. For the yeast two-hybrid 
networkjWe use a diffusion kernel |1H . Additionally, we considered a fifth kernel obtained by summing the 
first four kernels. This is a simple approach to data integration that has proved useful in previous work 

[iniEni- 

Tabled (top) shows the performance of each pairwise kernel for the five data sets. The MLPK is never 
worse than the TPPK kernel. The two kernels have similar performance on the sum kernel; MLPK is slightly 
better than TPPK on the expression, localization and phylogenetic profile kernels, and much better on the 
yeast two-hybrid dataset (76.6% vs. 59.2% in accuracy). 

Interestingly, we note that although connected pairs, i.e., pairs of enzymes acting successively in a path- 
way, are expected to have similar expression, phylogenetic profiles and localization (explaining the good 
performance of the MLPK on these datasets), the indirect approach implemented by the TPPK also gives 
good results for these data. This result implies that for these data, interacting pairs in the training set are 
often similar not only to each other but also to other interacting pairs in the training set. This observation is 
not surprising because, for example, if two proteins of the test set are co-localized in a particular organelle, 
it is likely that interacting pairs of proteins co-localized in the same organelle are also present in the training 
set, because there are not so many organelles where connected proteins can be. 

In the case of yeast two-hybrid data, on the other hand, the kernel between single proteins is defined as 
a diffusion kernel over the yeast two-hybrid graph. One can speculate that, in that case, similarity between 
pairs can be easily assessed and used by the MLPK to predict edges, but similarity between pairs as defined 
by the TPPK kernel is less likely to be observed. In a sense, the dimensionality of the feature space of the 
diffusion kernels is much larger than that defined by the other kernels, and a protein is only close to its 
neighbors in the yeast two-hybrid graph. 

3.2 Protein complex network 

Many proteins carry out their biological functions by acting together in multi-protein structures known as 
complexes. Understanding protein function therefore requires identification of these complexes. In the co- 
complex network, nodes are proteins, and an edge between proteins A and B exists if A and B are members of 
the same protein complex. Some high-throughput experimental methods, such as tandem affinity purification 
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followed by mass spectrometry, explicitly identify these co-complex relationships, albeit in a noisy fashion. 
Also, computational methods exist for inferring the co-complex network from individual data types or from 
multiple data types simultaneously [51 E| . We derived the co-complex data set based on an intersection of 
the manually curated MIPS complex catalogue JS] and the BIND complex data set The co-complex 
network contains 3280 edges connecting 797 proteins. In addition, our data set contains 3081 proteins with 
no co-complex relationships. 

For this evaluation, we again use four different data sets that we consider relevant to the co-complex 
network. The first data set is the same localization data that we used above The second is derived 
from a chip-based version of the chromatin immunoprccipitation assay (so-called "ChlP-chip" data) 7 . 
This assay provides evidence that a transcription factor binds to the upstream region of a given gene and 
is likely to regulate the expression of the given gene. Our data set contains data for 113 transcription 
factors, and so yields a vector of length 113 for each protein. The final two data sets are derived from the 
amino acid sequences of the yeast proteins. For the first, we compared each yeast protein to every model 
in the Pfam database of protein domain HMMs (pfam.wustl.edu) and recorded the E- value of the match. 
This comparison yields a vector of length 8183 for each protein. Finally, in a similar fashion, we compared 
each yeast protein to each protein in the Swiss-Prot database version 40 (ca.expasy.org/sprot) using 
PSI-BLAST m, yielding vectors of length 101,602. Each of the four data sets is represented using a scalar 
product kernel. 

We used the same experimental procedure to compare the quality of edge predictors for the co-complex 
network using MLPK and TPPK. The results, shown in Table ^ (bottom), again show the value of the 
MLPK approach. Using either performance metric (accuracy or ROC area), the MLPK approach performs 
better than the TPPK approach on three out of four data sets. 

Most striking is the improvement for the ChlP-chip data set (accuracy from 63.8% to 82.4%). This result 
is expected, because we know that proteins in the same complex must act in concert. As such, they are 
typically regulated by a common set of transcription factors. 

In contrast, the MLPK approach does not fare better than TPPK on the localization data set. This 
is, at first, suprising because two proteins must co- localize in order to participate in a common complex. 
This problem is thus an example of the direct inference case for which the MLPK is designed. However, the 
localization data is somewhat complex because (1) only approximately 70% of yeast proteins are assigned 
any localization at all, and (2) many proteins are assigned to multiple locations. As a result, among 3280 
positive edges in the training set, only 1852 (56%) of those protein pairs share exactly the same localiza- 
tion. Furthermore, 550 (16.8%) of the 3280 negative edges used in training connect proteins with the same 
localization, primarily "Unknown." These factors make direct inference using this data set difficult. The 
indirect method, by contrast, is apparently able to identify useful relationships, corresponding to specific 
localizations, that are enriched among the positive pairs relative to the negative pairs. 

4 Discussion 

We showed how a particular formulation of metric distance learning for graph inference can be formulated 
as a convex optimization problem and can be applied to any data set endowed with a positive definite 
kernel. A relaxation of this problem leads to the SVM algorithm with the new MLPK kernel (O between 
pairs. Experiments on two biological networks confirm the value of this approach for the reconstruction of 
biological network from heterogeneous genomic and proteomic data. 

Beyond the direct and indirect approaches to graph inference mentioned in the introduction, there exist 
many alternative ways to infer networks, such as estimating conditional independence between vertices with 
Bayesian networks jnj. An interesting property of methods based on supervised learning, such as the SVM 
with the TPPK and MLPK kernels, is the limited hypothesis made on the nature of the edges; the only 
hypothesis made is that there is information related to the presence or absence of edges in the data, and we 
let the learning algorithm model this information. The good accuracy obtained on two completely different 
networks (metabolic and co-complex) supports the general utility of this approach. 

An interesting and important avenue for future research is the extension of these approaches to inference 
of directed graphs, e.g., regulatory networks. Although the TPPK and MLPK approaches are not adapted 
as such to this problem, variants involving for example kernels between ordered pairs could be studied. 
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